Communication Pattern Based Checkpointing Coordination for Fault-tolerant Distributed Computing Systems
نویسندگان
چکیده
This paper presents a new checkpointing coordination scheme which utilizes the communication pattern of the cooperating processes. In the proposed scheme, the checkpointing is coordinated for the limited number of processes based on the information regarding the communication pattern of the target program. Unlike the previous solutions which do not utilize the communication pattern, it is possible to reduce the coordination eeort as well as the checkpointing frequency. Extensive simulation has been performed to evaluate the performance of the proposed scheme and we concluded that the proposed scheme signiicantly reduces the checkpointing overhead compared with the loose coordination schemes.
منابع مشابه
Application controlled checkpointing coordination for fault-tolerant distributed computing systems
In order to provide fault tolerance for distributed systems, the checkpointing technique has widely been used and many researches have been performed to reduce the overhead of check-pointing coordination. In this paper, we present a new checkpointing coordination scheme in which the application controls the coordination activity by utilizing the communication pattern of the application program....
متن کاملAn Enhanced MSS-based checkpointing Scheme for Mobile Computing Environment
Mobile computing systems are made up of different components among which Mobile Support Stations (MSSs) play a key role. This paper proposes an efficient MSS-based non-blocking coordinated checkpointing scheme for mobile computing environment. In the scheme suggested nearly all aspects of checkpointing and their related overheads are forwarded to the MSSs and as a result the workload of Mobile ...
متن کاملStability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid
Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...
متن کاملNew Paradigms in Check-pointing Techniques in Distributed Mobile Systems
Distributed systems today are ubiquitous and enable many applications, including client–server systems, transaction processing, the World Wide Web, and scientific computing, among many others. Distributed systems are not fault-tolerant and the vast computing potential of these systems is often hampered by their susceptibility to failures. Many techniques have been developed to add reliability a...
متن کاملComputing in the RAIN: a reliable array of independent nodes - Parallel and Distributed Systems, IEEE Transactions on
ÐThe RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007